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Abstract 

Clustering is a fundamental task in unsupervised 
learning. The focus of this paper is the Cor- 
relation Clustering functional which combines 
positive and negative affinities between the data 
points. The contribution of this paper is two fold: 

(i) Provide a theoretic analysis of the functional. 

(ii) New optimization algorithms which can cope 
with large scale problems (> 100 K variables) 
that are infeasible using existing methods. Our 
theoretic analysis provides a probabilistic gener- 
ative interpretation for the functional, and jus- 
tifies its intrinsic "model- selection" capability. 
Furthermore, we draw an analogy between op- 
timizing this functional and the well known Potts 
energy minimization. This analogy allows us 
to suggest several new optimization algorithms, 
which exploit the intrinsic "model- selection" ca- 
pability of the functional to automatically re- 
cover the underlying number of clusters. We 
compare our algorithms to existing methods on 
both synthetic and real data. In addition we sug- 
gest two new applications that are made possible 
by our algorithms: unsupervised face identifica- 
tion and interactive multi-object segmentation by 
rough boundary delineation. 



1 Introduction 

One of the fundamental tasks in unsupervised learning is 
clustering: grouping data points into coherent clusters. In 
clustering of data points, two aspects of pair- wise affini- 
ties can be measured: (i) Attraction (positive affinities), 
i.e., how likely are points i and j to be in the same clus- 
ter, and (ii) Repulsion (negative affinities), i.e., how likely 
are points i and j to be in different clusters. 

Indeed, new approaches for clustering, recently presented 
by Yu and Shi [2001 ] and Bansal et al. [ ], suggest to com- 
bine attraction and repulsion information. Normalized cuts 



was extended by Yu and Shi [2001 ] to allow for negative 
affinities. However, the resulting functional provides sub- 
optimal clustering results in the sense that it may lead to 
fragmentation of large homogeneous clusters. 

The Correlation Clustering functional (CC), proposed by 
Bansal et al. J, tries to maximize the intra-cluster agree- 
ment (attraction) and the inter-cluster disagreement (repul- 
sion). Contrary to many clustering objectives, the CC func- 
tional has an inherent "model- selection" property allowing 
to automatically recover the underlying number of clusters 
[ Demaine and Imm orlica | . 

Optimizing CC is tightly related to many graph partition- 
ing formulations [Nowozin and Jegelka 2009 ], however it 
is known to be NP-hard [Bansal e t al. [ . Existing methods 
derive convex continuous relaxations to approximately op- 
timize the CC functional. However, these algorithms do not 
scale beyond a few hundreds of variables. See for exam- 
ple, t he works of [[Nowozin and Jegelka 2009[ Bagon et al. 



[MTOllVitaladevuni and Basri 2010HGlasner et al. 2011| . 

This work suggests a new perspective on the CC functional, 
showing its analogy to the known Potts model. This new 
perspective allows us to leverage on recent advances in dis- 
crete optimization to propose new CC optimization algo- 
rithms. We show that our algorithms scale to large number 
of variables (> 100K), and in fact can tackle tasks that 
were infeasible in the past, e.g., applying CC to pixel- 
level image segmentation. In addition, we provide a rigor- 
ous statistical interpretation for the CC functional and jus- 
tify its intrinsic model selection capability. Our algorithms 
exploit this "model selection" property to automatically re- 
cover the underlying number of clusters k. 

The contributions of this paper are as follows: 

• A rigorous probabilistic interpretation of the CC func- 
tional, justifying its intrinsic model selection capability. 

• A new perspective to the functional, drawing analogy to 
the discrete Potts model. 

• New large scale optimization algorithms, that stem from 
our new perspective. 

• Our algorithms automatically recover the underlying 
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number of clusters k. 

• New applications in vision and graphics. 

The first part of the paper (Sec. [2]) focuses on the theoreti- 
cal probabilistic interpretation of the CC functional. The 
subsequent sections are dedicated to the second part of 
this work which concerns the optimization of the CC func- 
tional. 

Correlation Clustering (CC) Functional 

Let W G W 1 x 71 be an affinity matrix combining attraction 
and repulsion: for Wij > we say that i and j attract each 
other with certainty | | , and for Wij < we say that i 
and j repel each other with certainty | Wij | . Thus the sign 
of Wij tells us if the points attract or repel each other and 
the magnitude of Wij indicates our certainty. 

Any &-way partition of n points can be written as U G 
{0,l} nxfe s.t. Ui C = 1 iff point i belongs to cluster c. 
^2 c Uic = 1 Vz ensure that every i belongs to exactly one 
cluster. 



The CC functi onal maximizes the intra-cluster agreement 
jBansal et al. \ . Given a matrix an optimal partition U 
minimizes: 

£CC(U) = ~J2 W i^ UicU ^ (1) 

ij c 

s.t. u ic e {o,i}, J2 Uic = 1 

c 

Note that ^2 c Ui C Uj C equals 1 iff i and j belong to the same 
cluster. For brevity, we will denote J2 C Ui C Uj C by [t^ T ] i - 
from here on. 

2 Probabilistic Interpretation 

This section provides a probabilistic interpretation for the 
CC functional. This interpretation allows us to provide a 
theoretic justification for the "model selection" property of 
the CC functional. Moreover, our analysis exposes the un- 
derlying implicit prior that this functional assumes. 

We consider the following probabilistic generative model 
for matrix W. Let U be the true unobserved partition of n 
points into clusters. Assume that for some pairs of points 
z, j we observe their pairwise similarity values Sij. These 
values are random realizations from either a distribution /+ 
or f~, depending on whether points z, j are in the same 
cluster or not. Namely, 



P[Sy 



[UU T 
[UU T 



1 



p(s ij = s\[UU T ]. j = o) = 



r (s) 



! Note that W may be sparse. The "missing" entries are sim- 
ply assigned "zero certainty" and therefore they do not affect the 
optimization. 



Assuming independency of the pairs, the likelihood of ob- 
serving similarities {s^} given a partition U is then 

^(K}i^)=n/ + (^) M -/-(^) (1 " My) 



To infer a partition U using this generative model we look 
at the posterior distribution: 

Pr(U\{sij})<x£({sij}\U)-Pr(U) 

where Pr (U) is a prior. Assuming a uniform prior over all 
partitions, i.e., Pr (U) = const, yields: 

Pr(U\{s t A)ocUfUsJ UUT] --r(s^ UU %) 

Then, the negative logarithm of the posterior is given by 

- log Pr (U | { Sii }) = C + ]Tlog/ + ( Sii )[[/[/ T ].. 

+ 5> g /- { Sij )(l-[UU T \^ 

ij 

where C is a constant not depending on U. 

Interpreting the affinities as log odds ratios Wij — 

l°g ( /-(s-) ) ' me P os t er i° r becomes 

- logPr (U\{ Sij } ) = C-J^Wn [UU T ].. (2) 

ij 

That is, Eq. ^ estimates the log-posterior of a partition U. 
Therefore, a partition U that minimizes Eq. ^ is the MAP 
(maximum a-posteriori) partition. Since Eq. ([T]) and Eq. ^ 
differ only by a constant they share the same minimizer: the 
MAP partition. 

2.1 Recovering k (a.k.a. "model selection") 

We showed that the generative model underlying the CC 
functional has a single model for all partitions, regardless 
of k. Therefore, optimizing the CC functional one need 
not select between different generative models to decide 
on the optimal k. Comparing partitions with different k 
is therefore straight forward and does not require an addi- 
tional "model complexity" term (such as BIC, MDL, etc.) 

As described in the previous section the CC functional as- 
sumes a uniform prior over all partitions. This uniform 
prior on U induces a prior on the number of clusters fc, 
i.e., what is the a-priori probability of U having k clusters: 
Pr (k) = Pr (U has k clusters). We use Stirling numbers 
of the second kind [ Rennie and Dobs on 1969) to compute 
this induced prior on k. Fig [T] shows the non-trivial shape 
of this induced prior on the number of clusters k. 
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k=l n/log n k=n 
Number of clusters k 



Figure 1: Prior on the num- 
ber of clusters k: Graph shows 
— log Pr (fc), for uniformly dis- 
tributed U. The induced prior 
on k takes a non-trivial shape: 
it assigns very low probability 
to the trivial solutions ofk = l 
and k = n, while at the same 
time gives preference to parti- 
tions with non-trivial k. The mode of this prior is when U 
has roughly clusters. 



3 CC Optimization: Continuous Perspective 

After discussing the theory behind the CC functional and 
providing probabilistic justification to its model selection 
capability, we move to discuss methods and approaches to 
optimize this functional. We begin with a brief glance at 
current state-of-the-art CC optimization algorithms. 

Optimizing the correlation clustering functional (Eq. ([I])) 
is NP-hard [ Bansal et al. [ . Instead of solving directly 
for a partition U, existing methods optimize indirectly 
for the binary adjacency matrix X = UU T , i.e., Xij = 
1 iff i and j belong to the same cluster. By introduc- 
ing the binary adjacency matrix the quadratic objective 
(w.r.t. U): - Zij Wij [uu 1 



'] .. becomes linear (w.r.t. X): 
— ^Zij X^. The connected components of X, after 
proper rounding, are the resulting clusters, and the num- 
ber of clusters k naturally emerges. Indirect optimization 
methods must ascertain that the feasible set consists only 
of "decomposable" X: X = UU T . This may be achieved 



either by posing semi-definite constraints on X [ Vitalade 
vuni and Basri 2010|, or by introducing larg e number of 



linear inequalities jDemaine and Imm orlica~[ Vitaladevuni 
|and Ba sri 2010 ]. These methods take a continuous and con- 
vex relaxation approach to approximate the resulting func- 
tional. This approach allows for nice theoretical proper- 
ties due to the convex optimization at the cost of a very 
restricted scalability. 

Solving for X requires ~ n 2 variables instead of only ~ n 
when solving directly for U. Therefore, these methods 
scale poorly with the number of variables n, and in fact, 
they cannot handle more than a few hundreds of variables. 
In summary, these methods suffers from two drawbacks: 
(i) recovering U from X is highly susceptible to noise and 
more importantly (ii) it is infeasible to solve large scale 
problems by these methods. 

4 Our New Perspective on CC 

Existing methods view the CC optimization in the con- 
text of convex relaxation and build upon methods and ap- 



proaches that are common practice in this field of continu- 
ous optimization. We propose an alternative perspective to 
the CC optimization: viewing it as a discrete energy min- 
imization. This new perspective allows us to build upon 
recent advances in discrete optimization and propose effi- 
cient and direct CC optimization algorithms. More impor- 
tantly, the resulting algorithms solve directly for U, and 
thus scales significantly better with the number of vari- 
ables. 

We now show how to cast the CC functional of Eq.Q as a 
discrete pair-wise conditional random field (CRF) energy. 
For ease of notation, we describe a partition U using a la- 
beling vector L e {1,2.. .} n : k = c iff Ui c = 1. A gen- 
eral form of pair- wise CRF energy is E (L) = J2i Ei (h) + 
J2ij Eij (li, lj) |Boykov et al. 2002| . Discarding the unary 
term Q2 i Ei (I if), and taking the pair- wise term to be Wij 
if li 7^ lj we can re-write the CC functional as a CRF en- 
ergy: 



Sec (L) = J2Wi j 't [h ^i j 



(3) 



This is a Potts model. Optimizing the CC functional can 
now be interpreted as searching for a MAP assignment for 
the energy ([3]). 

The resulting Potts energy has three unique characteristics, 
each posing a challenge to the optimization process: 

(i) Non sub -modular: The energy is non sub-modular. 
The notion of sub-modularity is the discrete analogue of 
convexity from continuous optimization [ Lovasz 1983) . 
Optimizing a non sub-modular energy is NP-hard, even for 
the binary case [Rother et al. 2007] . 

(ii) Unknown number of labels: Most CRF energies are 
defined for a fixed and known number of labels. Thus, the 
search space is restricted to L G {1, . . . , k} n . When the 
number of labels k is unknown the search space is by far 
larger and more complicated. 

(iii) No unary term: There is no unary term in the energy. 
The unary term plays an important role in guiding the opti- 
mization process [ Szelis ki et al. 20 08]. Moreover, a strong 
unary term is crucial when the energy in non sub-modular 
IRotheretal. 20071 . 

There exist examples of CRFs in the literature that 
share some of these characteristics (e.g., non sub-modular 
IRother et al. 2007| |Kolmogorov and Wainwright 2005) , 
unknown number of labels (Isack and Boykov 201 1 ; Bleyer 
|et al. 2010 ]). Yet, to the best of our knowledge, no exist- 
ing CRF exhibits all these three challenges at once. More 
specifically, we are the first to handle non sub-modular en- 
ergy that has no unary term. Therefore, we cannot just use 
"off-the-shelf Potts optimization algorithms, but rather 
modify and improve them to cope with the three challenges 
posed by the CC energy. 
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Algorithm 1: Expand-and-Explore 

Input: Affinity matrix W eR nxn 
Output: Labeling vector L e {l,2,...} n 

Init Li <— 1, i = 1, . . . , n // initial labeling 
repeat 

for a <- 1 ; a < #L + 1 ; a + + do 

|_ L <— Expand {a) 

until L is unchanged 

#L denotes the number of different labels in L. 
Expand {a) : expanding a using QPBOI. 
By letting a = #L + 1 the algorithm "expand" and 
explore an empty label. This may affect the number of 
labels #L. 



Algorithm 2: Swap-and-Explore 

Input: Affinity matrix W G R nxn 
Output: Labeling vector L e {l,2,...} n 

Init Li 1, i = 1, . . . , n // initial labeling 
repeat 

for a <— 1 ; a < #L ; a + + do 

for f3 <- a ; (3 < #L + 1 ; (3 + + do 

\_ L ^- Swap (a, (3) 

until L is unchanged 

#L denotes the number of different labels in L. 
Swap (a, P) : swapping labels a and /? using QPBOI. 
By letting f3 = #L + 1 the algorithm explore new number 
of clusters, this may affect the number of labels #L. 



5 Our Large Scale CC Optimization 

In this section we adapt known discrete energy minimiza- 
tion algorithms to cope with the three challenges posed by 
the CC energy. We derive three CC optimization algo- 
rithms that stem from either large move making algorithms 
(a-expand and a/3-swap [Boykov et al. 2002]), or Iterated 
Conditional Modes (ICM) [Besag 1986| . Our resulting al- 
gorithms scale gracefully with the number of variables n, 
and solve CC optimization problems that were infeasible 
in the past. Furthermore, our algorithms take advantage 
of the intrinsic model selection capability of the CC func- 
tional (Sec. [2]) to robustly recover the underlying number 
of clusters. 

5.1 Improved large move making algorithms 

Boykov et al.[ 2002 ] introduced a very effective method for 
multi-label energy minimization that makes large search 
steps by iteratively solving binary sub-problems. There 
are two large move making algorithms: a-expand and a/3- 
swap that differ by the binary sub-problem they solve, a- 
expand consider for each variable whether it is better to re- 
tain its current label or flip it to label a. The binary step of 
a/3- swap involves only variables that are currently assigned 
to labels a or /?, and consider whether it is better to retain 
their current label or switch to either a or (3. Defined for 
sub-modular energies, the binary step in these algorithms 
is solved using graph-cut. 

We propose new optimization algorithms: Expand-and- 
Explore and Swap-and-Explore, inspired by a-expand and 
a (3 -swap, that can cope with the challenges of the CC en- 
ergy, (i) For the binary step we use a solver that handles 
non sub-modular energies, (ii) We incorporate "model se- 
lection" into the iterative search to recover the underlying 
number of clusters k. (iii) In the absence of unary term, 
a good initial labeling is provided to the non sub-modular 
binary solver. 



Binary non sub-modular energies can be approximated by 
an extension of graph-cuts: QPBO [Rot her et al. 2007) . 
When the binary energy is non sub-modular QPBO is not 
guaranteed to provide a labeling for all variables. Instead, 
it outputs only a partial labeling. How many variables are 
labeled depends on the amount of non sub-modular pairs 
and the relative strength of the unary term for the specific 
energy. When no unary term exists in the energy QPBO 
leaves most of the variables unlabeled. To circumvent this 
behavior we use the "improve" extension of QPBO (de- 
noted by QPBOI): This extension is capable of improv- 
ing an initial labeling to find a labeling with lower energy 
|Rother et al. 2 007]. In the context of expand and swap 
algorithms a natural initial labeling for the binary steps is 
to use the current labels of the variables and use QPBOI to 
improve on it, ensuring the energy does not increase during 
iterations. 

To overcome the problem of finding the number of clus- 
ters k our algorithms do not iterate over a fixed number of 
labels, but explore an "empty" cluster in addition to the ex- 
isting clusters in the current solution. Exploring an extra 
empty cluster allows the algorithms to optimize over all so- 
lutions with any number of clusters k. The fact that there 
is no unary term in the energy makes it straight forward 
to perform. Alg. [T] and Alg. [2] presents our Expand-and- 
Explore and Swap-and-Explore algorithms in more detail. 



5.2 Adaptive-label ICM 

Another discrete energy minimization method that we 
modified to cope with the three challenges of the CC op- 
timization is ICM [Besa g 1986| . It is a point- wise greedy 
search algorithm. Iteratively, each variable is assigned the 
label that minimizes the energy, conditioned on the cur- 
rent labels of all the other variables. ICM is commonly 
used for MAP estimation of energies with a fixed number 
of labels. Here we present an adaptive-label ICM: using 
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the ICM conditional iterations we adaptively determine the 
number of labels k. Conditioned on the current labeling, 
we assign each point to the cluster it is most attracted to, or 
to a singleton cluster if it is repelled by all. 



In this section we proposed a new perspective on CC op- 
timization. Interpreting it as MAP estimation of Potts en- 
ergy allows us to propose a variety of efficient optimization 
method^) 

• Swap-and-Explore (with binary step using QPBOI) 

• Expand- and-Explore (with binary step using QPBOI) 

• Adaptive-label ICM 

Our proposed approach has the following advantages: 

(i) It solves only for n integer variables. This is by far less 
than the number of variables required by existing methods 
described in Sec.[3j that requires ~ n 2 variables of the ad- 
jacency matrix X = UU T . It makes our approach capable 
of dealing with large number of variables (> 100 K) and 
suitable for pixel-level image segmentation. 

(ii) The algorithms solve directly for the cluster member- 
ship of each point, thus there is no need for rounding 
scheme to extract U from the adjacency matrix X. 

(iii) The number of clusters k is optimally determined by 
the algorithm and it does not have to be externally supplied 
like in many other clustering/segmentation methods. 



In their work Eisner and Schudy 1 2009 ] proposed a greedy 
algorithm to optimize the CC functional over complete 
graphs. Their algorithm is in fact an ICM method presented 
outside the proper context of CRF energy minimization, 
and thus does not allow to generalize the concept of dis- 
crete optimization to more recent optimization methods. 



6 Experimental Results 



This section evaluates the performance of our proposed op- 
timization algorithms using both synthetic and real data. 
We compare to both existing discrete optimization algo- 
rithms that can handle multi-label non sub-modular en- 
ergi es (TRW-S [Kolmogorov and Wa inwright 2005) and 



BP |Pearl 1988| 3 |), and to existing state-of-the-art CC opti- 
mization method of Vitaladevuni and Basri 12010) . Since 
existing CC optimization methods do not scale beyond sev- 
eral hundreds of variables, extremely small matrices are 
used in the following experiments. We leave it to Sec. [7] 
to evaluate our method on large scale problems. 



6.1 Synthetic data 

This experiment uses synthetic affinity matrices W to com- 
pare our algorithms to existing Potts optimization algo- 
rithms. The synthetic data have 750 variables randomly 
assigned to 15 clusters with different sizes (ratio between 
larger to smaller cluster: ~ x5). For each variable we 
sampled roughly the same number of neighbors: of which 
^ 25% are from within the cluster and the rest from the 
other clusters. We corrupted the clean ground- truth adja- 
cency matrix with 20% noise affecting both the sign of Wij 
and the certainty (i.e., \Wij |). Overall the resulting percent 
of positive (sub-modular) connections is ~ 30%. 

We report several measurements for these experiments: 
run-time, energy (Sec), purity of the resulting clusters and 
the recovered number of clusters k for each of the algo- 
rithms as a function of the sparsity of the matrix W, i.e., 
percent of non-zero entries. Each experiment was repeated 
10 times with different randomly generated matrices. 

Fig. [2] shows results of the synthetic experiments. Existing 
multi-label approaches (TI S and BP) do not perform 
too well: higher fees lower purity and incorrect recov- 
ery of k. This demonstrates the difficulty of the energy 
minimization problem that has no unary term and many 
non sub-modular pair- wise terms. These results are in ac- 
cordance with the observations of Kolmogorov and Wain- 
wright |2005] when the energy is hard to optimize. 

For our large move making algorithms, Expand-and- 
Explore provides marginally better clustering results than 
the Swap-and-Explore. However, its relatively slow run- 
ning time makes it infeasible for large CC problem^] A 
somewhat surprising result of these experiments shows that 
for matrices not too sparse (above 10%), adaptive-label 
ICM performs surprisingly well. In fact, it is significantly 
faster than all the other methods and manages to converge 
to the correct number of clusters with high purity and low 
energy. 

From these experiments we conclude that Swap-and- 
Explore (Alg. [2]) is a very good choice of optimization al- 
gorithm for the CC functional. However, when the affinity 
matrix W is not too sparse, it is worth while giving our 

adaptive-label ICM a shot. 

6.2 Co-clustering data 

The following experiment compares our algorithms with 
a state-of-the-art CC optimization method, R-LP, of Vita- 
ladevuni and Basri [2010). For this comparison we use 
affinity matrices computed for co- segmentation. The co- 



Matlab implementation available at: http : / /www. 



|wisdom. weizmann . ac . 11/ ~ bagon /mat lab . html| 

J Since these two algorithms work only with pre-defined num- 
ber of clusters k, we over-estimate k and report only the number 
of non empty clusters in the solution. 



4 This difference in run time between Expand and Swap can 
be explained by looking at the number of variables involved in 
each of the binary steps carried out: For the expand algorithm, 
each binary step involves almost all the variables, while the binary 
swap move considers only a small subset of the variables. 
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(a) Energy (lower=better) (b) Recovered k (GT in dashed) 



(c) Purity 



(d) Run time 
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Legend: Swap-and-Explore, Expand-and-Explore, ICM, TRW-S, BP 



Figure 2: Synthetic results: Graphs comparing (a) Energy at convergence, (b) Recovered number of clusters, (c) Purity 
of resulting clusters, (d) Run time of algorithms (log scale). RW-S and BP are almost indistinguishable, as are Swap and 
Expand in most of the plots. 
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Energy ratio 


98.6 


98.4 


77.4 


83.8 


83.6 


(%) 


±1.4 


±1.9 


±23.9 


±5.4 


±6.3 


Strictly lower 

(> 100%) 


15% 


11.7% 












Table I: Comparison to Glasner et al. |2011| : Ratio be- 
tween our energy and of Glasner et al. : Since all ener- 
gies are negative, higher ratio means lower energy. Ratio 
higher than 100% means our energy is better than Glas- 
ner et al. Bottom row shows the percentage of cases where 
each method got strictly lower energy than Glasner et al.. 



segmentation problem can be formulated as a correlation 
clustering problem with super pixels as the variables [ Glas- 
|ner et al. 20 ill . 

We obtained 77 affinity matrices, courtesy of Glas- 
ner et al.[ 20lT| , used in their experiments. The number 
of variables in each matrix ranges from 87 to 788, Their 
sparsity (percent of non-zero entries) ranges from 6% to 
50%, and there are roughly the same number of positive 
(sub-modular) and negative (non sub-modular) entries. 

Table [T] shows the ratio between our energy and the en- 
ergy of R-LP method. The table also shows the percent 
of matrices for which our algorithms found a solution with 
lower energy than R-LP. The results show the superiority 
of our algorithms to existing multi-label energy minimiza- 
tion approaches (TRW-S and BP). Furthermore, it is shown 
that our methods are in par with existing state-of-the-art CC 
optimization method on small problems. However, unlike 
existing methods, our algorithms can be applied to prob- 
lems two orders of magnitude larger. Optimizing directly 
for U not only did not compromise the performance of our 
method, but also allows us to handle large scale CC opti- 
mization, as demonstrated in the next section. 




(a) Input image and 
boundary scribbles (red) 



(b) Resulting 
segmentation 



Figure 3: Interactive multi-object segmentation: 

(a) The user provides only crude and partial indications 
to the locations of boundaries between the relevant objects 
in an image (red), (b) The output of our algorithm cor- 
rectly segments the image into multiple segments. Image 
was taken from [Alpert et al. 2007]. 

7 New Applications 

In this section we present two new applications made pos- 
sible by our large scale CC optimization. Both these ap- 
plications build upon integrates attraction and repulsion in- 
formation between large number of points, and requires the 
robust recovery of the underlying number of clusters k. 

7.1 Interactive multi-object segmentation (Patent 
Pending) 

Our first experiment demonstrates the ability of our algo- 
rithm to handle large scale CC problem (pixel-level seg- 
mentation). 

Negative affinities in image segmentation may come very 
naturally from boundary information: pixels on the same 
side of a boundary are likely to be in the same segment 
(attraction), while pixels on opposite sides of a bound- 
ary are likely to be in different segments (repulsion). We 
use this observation to design a novel approach to inter- 
active multi-object image segmentation. Instead of using 
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(a) Boundary scribbles — ► Matting constraints (b) "Soft" segmentations — ► segment membership vector (c) Membership correlations — > affinity matrix 

Figure 4: From boundary scribbles to affinity matrix: (a) A boundary scribble is drawn by the user (red), inducing 
"figure/ground" regions on its opposite sides (black and white regions), (b) For each scribble we use the method of 
Levine et al. [2008] to generate a soft segmentation of the image into two segments: pixel values in the soft segmentation are 
in the range [—1,1]. Pixels far away from the scribble are assigned as it is uncertain to what segment they should belong 
to. Each pixel i is described using a segmentation membership vector Vi with an entry corresponding to its assignment 
at each soft segmentation (red columns), (c) A non-zero entry Wij in the sparse affinity matrix is the correlation between 
normalized vectors V{ andvy. Wij = vjvj/ \\vi\\ • We also add strong repulsion across each scribble. 



k different "strokes" for the different objects (e.g., Sant- 
ner et al. |2011| ), the user applies a single "brush" to indi- 
cate parts of the boundaries between the different objects. 
Using these sparse and incomplete boundary hints we can 
correctly complete the boundaries and extract the desired 
number of segments. Although the user does not provide 
at any stage the number of objects k, our method is able to 
automatically detect the number of segments using only the 
incomplete boundary cues. Fig. [3] provides an example of 
our novel interactive multi-object segmentation approach. 

Computing affinities: Fig. [4] illustrates how we use spo- 
radic user-provided boundary cues to compute a sparse 
affinity matrix with both positive and negative entries. Note 
that this is a modification of the affinity computation pre- 
sented by Stein etal.[2008|: (i) We use the interactive 
boundary cues to drive the computation, rather than some 
boundaries computed by unsupervised technique, (ii) We 
only compute a small fraction of all entries of the matrix, 
as opposed to the full matrix of Stein et al.(iii) Most impor- 
tantly, we end up with both positive and negative affinities 
in contrast to Stein et al. who use only positive affinities. 

The sparse affinity matrix W is very large (~ 100/c x 100/c). 
Existing methods for optimizing the correlation clustering 
functional are unable to handle this size of a matrix. We 
applied our Swap-and-Explore algorithm (Alg. [2]) to this 
problem and it provides good looking results with only sev- 
eral minutes of processing per image. 

Fig. [5] shows input images and user marked boundary cues 
used for computing the affinity matrix. Our results are 
shown at the bottom row. 

The new interface allows the user to segment the image into 
several coherent segments without changing brushes and 
without explicitly enumerate the number of desired seg- 



ments to the algorithm. 



7.2 Clustering and face identification 

Our next experiment is to show that detecting the under- 
lying number of clusters k can be an important task on 
its own. Given a collection of face images we expect the 
different clusters to correspond to different persons. Iden- 
tifying the different people requires not only high purity 
of the resulting clusters but more importantly to correctly 
discover the appropriate number of clusters. This exper- 
iment is an extension of existing work on the problem of 
"same/not-same" learning. Following recent metric learn- 
ing approach (e.g., [Guillaumin et al. 2009; Guillaumin 
|et al. 2010) ) we learn a single classifier that assigns a prob- 
ability for each pair of faces: "how likely is this pair to be 
of the same person". Then, using this classifier, we are able 
to determine the number of persons and cluster the faces of 
unseen people. That is, given a new set of face images of 
several unseen people, our clustering approach is able to 
automatically cluster and identify how many different peo- 
ple are in the new set of face images of never seen before 
people. 



For this experiment we use PUT face dataset [Kasinski 



|et al. 2008[ . The dataset consists of 9971 images of 100 
people (roughly 100 images per person). Images were 
taken in partially controlled illumination conditions over a 
uniform background. The main sources of face appearance 
variations are changes in head pose, and facial expression. 



We use the same method as Guillaumin et al. [ 2009 ] to de- 
scribe each face. SIFT descriptors are computed at fixed 
points on the face at multiple scales. We use the annota- 
tions provided in the dataset to generate these keypoints. 
Given a training set of labeled faces Vi}f =1 we use a 
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Figure 5: Interactive segmentation results. Input image and user boundary cues (top), our result (bottom). Images were 
taken from [Alp ert et al. 2007]. 



state-of-the-art method by Guillaumin et al. |2010| to learn 
a Mahalanobis distance L and threshold b such that: 

Pr (yi =yj\xi,Xj]L,b) = a (b - {xi - Xj) T L T L(xi - xj) 

where a(z) = (1 — e _z ) _1 is the sigmoid function. 

For each experiment we chose k people for test (roughly 
100 • k images), and used the images of the other 100 — k 
people for training. The learned distance is then used to 
compute Pij , the probability that faces i and j belong to the 
same person, for all pairs of face images of the k people 
in the test set. The affinities are set to Wij = log , . 
We apply our clustering algorithm to search for an optimal 
partition, and report the identified number of people k' and 
the purity of the resulting clusters. We experimented with 
k = 15, 20, . . . , 35. For each k we repeated the experi- 
ments for several different choices of k different persons. 

In these settings all our algorithms performed roughly the 
same in terms of recovering k and the purity of the result- 
ing clustering. However, in terms of running time adaptive- 
label ICM completed the task significantly faster than other 
methods. We compare Swap-and-Explore to two differ- 
ent approaches: (i) Connected components: Looking at the 
matrix of probabilities p^ , thresholding it induces k' con- 
nected components. Each such component should corre- 
spond to a different person. At each experiment we tried 10 
threshold values and reported the best result, (ii) Spectral 
gap: Treating the probabilities matrix as a positive affin- 
ity matrix we use NCuts |Shi and Malik 2 000] to cluster 
the faces. For this method the number of clusters k' is de- 
termined according to the spectral gap: Let be the i th 
largest eigenvalue of the normalized Laplacian matrix, the 
number of clusters is then k' = arg max^ 



Xi 
1 A i+ i 



Fig. [6] shows cluster purity and the number of different per- 
sons k' identified as a function of the actual number of 
people k for the different methods. Our method succeeds 
to identify roughly the correct number of people (dashed 
black line) for all sizes of test sets, and maintain relatively 
high purity values. 
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Figure 6: Face identification: Graphs showing our result 
(Swap), spectral and connected components. Left: recov- 
ered number of people (k') vs. number of people in the test 
set. Dashed line shows the true number of people. Right: 
purity of resulting clusters. 



trinsic "model selection" capability. Using a generative 
probabilistic formulation allows for a better understanding 
of the functional, underlying assumptions it makes, includ- 
ing the prior it imposes on the solution. 

Apart from establishing theoretic aspects of the CC func- 
tional, this work also suggests a new perspective on the 
functional, viewing it as a discrete Potts energy. The result- 
ing energy minimization presents three challenges: (i) the 
energy is non sub-modular, (ii) the number of clusters is not 
known in advance, and (iii) there is no unary term. We pro- 
posed new energy minimization algorithms that can suc- 
cessfully cope with these challenges. 

Optimizing large scale CC and robustly recovering the un- 
derlying number of clusters allows us to propose new ap- 
plications: interactive multi-label image segmentation and 
unsupervised face identification. 
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